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Abstract: Handling skew is one of the major challenges 
in query processing. In distributed computational environ¬ 
ments such as MapReduce, uneven distribution of the data 
to the servers is not desired. One of the dominant measures 
that we want to optimize in distributed environments is com¬ 
munication cost. In a MapReduce job this is the amount of 
data that is transferred from the mappers to the reducers. 
In this paper we will introduce a novel technique for han¬ 
dling skew when we want to compute a multiway join in 
one MapReduce round with minimum communication cost. 
This technique is actually an adaptation of the Shares algo¬ 
rithm [^. 

1. Introduction 

Systems such as Pig or Hive that implement SQL or re¬ 
lational algebra over MapReduce have mechanisms to deal 
with joins where there is significant skew; i.e., certain values 
of the join attribute(s) appear very frequently (see, e.g., 

1^ . These systems use a two-round algorithm, where the 
first round identifies the heavy hitters (HH), those values of 
the join attribute(s) that occur in at least some given frac¬ 
tion of the tuples. In the second round, tuples that do not 
have a heavy-hitter for the join attribute(s) are handled nor¬ 
mally. That is, there is one reduceij^for each key, which is 
associated with a value of the join attribute(s). Since the 
key is not a heavy hitter, this reducer handles only a small 
fraction of the tuples, and thus will not cause a problem of 
skew. For tuples with heavy hitters, new keys are created 
that are handled along with the other keys (normal or those 
for other heavy hitters) in a single MR job. The new keys 
in these systems are created with a simple technique as in 
the following example: 


^In this paper, we use the term reducer to mean the appli¬ 
cation of the Reduce function to a key and its associated 
list of values. It should not be confused with a Reduce task, 
which typically executes the Reduce function on many key 
and their associated values. 


Example 1.1. We have to eompute the join R{A, B) txi 
S{B, C) using a given number, k, of reducers. Suppose value 
b for attribute B is identified as a heavy hitter. Suppose 
there are r tuples of R with B = b and there are s tuples 
of S with B — b. Suppose also for convenience that r > 
s. The distribution to k buckets/reducers is done in earlier 
approaches by partitioning the data of one of the relations 
in k buckets (one bucket for each reducer) and sending the 
data of the other relation to all reducers. Of course since 
r > 8, it makes sense to choose relation R to partition. Thus 
values of attribute A are hashed to k buckets, using a hash 
function h, and each tuple of relation R with B = b is sent 
to one reducer - the one that corresponds to the bucket that 
the value of the first argument of the tuple was hashed. The 
tuples of S are sent to all the k reducers. Thus the number 
of tuples transferred from mappers to reducers is r -\- ks. 

The approach described above appears not only in Pig and 
Hive, but dates back to [^. The latter work, which looked at 
a conventional parallel implementation of join rather than 
a MapReduce implementation, uses the same (non-optimal) 
strategy of choosing one side to partition and the other side 
to replicate. In particular, these techniques are not opti¬ 
mal with respect to communication cost (i.e., the number of 
inputs transferred from the mappers to the reducers EB- 

Our contribution: In Example |1.2| we show how we can 
do significantly better than the standard technique of Ex¬ 
ample the rest of the paper we show how the idea in 

Example |1.2| can be extended to apply on any multiway join 
and for any number of heavy hitters. In particular, we show 
how to adapt Shares algorithm to find a solution that 
minimizes communication cost in the case there are heavy 
hitters. 

Example 1.2. We take again the join R{A, B) txi S{B,C). 
We partition the tuples of R with B = b into x groups and 
we also partition the tuples of S with B = b into y groups, 
where xy = k. We use one of the k reducers for each pair 
{i,j) for a group i from R and for a group j from S. Now 
we are going to partition tuples from R and S and we use 
hash functions hr and ha to do the partitioning. We send 
each tuple (a,b) of R to all reducers of the form (i,q), where 
i = hr (a) is the group in which tuple (a,b) belongs and q 
ranges over all y groups. Similarly, we send each tuple (6, a) 
of R to all reducers of the form {q, i), where i = hs{a) is the 
group in whieh tuple {b, a) belongs and q ranges over all x 
groups. Thus each tuple with B — b from R is sent to y 
reducers and each tuple with B = b from S is sent to x 
reducers. Hence the communication cost is ry + sx. We 




can show (see B that by minimizing ry + sx under the 
constraint xy = k we achieve communication cost equal to 
\/2 fcrs, which is always less than what we found in Exam¬ 


ple 1.1 which was r-\-ks. The proof is easy: 'f2krs < r + fcs 


or 0 < \Jr js — + k\J sjr, which is a second order poly¬ 

nomial wrto Vk as unknown and it is positive for any k. 
Moreover observe that the improvement is significant: The 
optimal communication cost grows as y/k, while r + fcs grows 
linearly with k. 


Related Work There is a lot of work over the decades 
about how to handle skew when we process queries. We 
will limit ourselves here to recent work that considers joins 
in MapReduce or discusses the Shares algorithm. In it 
is proven that with high probability the Shares algorithm 
distributes tuples evenly on uniform databases (these are 
defined precisely in to be databases which resemble the 
case of random datciyT Then, generalizes and enhances 
results in and describes how the Shares algorithm 

behaves on skewed data: it shows that the algorithm is re¬ 
silient to skew, and gives an upper bound even on skewed 
databases. However this resilience applies to ordinary joins 
that use many of the attributes in one relation allowing thus 
the tuples with a heavy hitter to be distributed in many 
reducers. However this is not the case in the 2-way join 
example we gave - and many others. 


2. Shares Algorithm 

The algorithm is based on a schema according to which we 
distribute the data to a given number of k reducers. Each 
reducer is defined by a vector, where each component of the 
vector corresponds to an attribute. The algorithm uses a 
number of independently chosen random hash functions hi 
one for each attribute Xi. Each tuple is sent to a num¬ 
ber of reducers depending on the value of hi for the spe¬ 
cific attribute Xi in this tuple. If Xi is not present in 
the tuple, then the tuple is sent to all reducers for all hi 
values. For an example, suppose we have the 3-way join 
Ri{Xi, X 2 ) [XI R 2 {X 2 , X 3 ) rx R 3 {X 3 , Xi). In this example 
each reducer is dehned by a vector (x,y,z). A tuple (a, &) 
of R\ is sent to a number of reducers and specifically to re¬ 
ducers {hi{a),h 2 {b),i) for all i. I.e., this tuple needs to be 
replicated a number of times, and specifically in as many 
reducers as is the number of buckets into which hs hashes 
the values of attribute X 3 . 

When the hash function hi hashes the values of attribute 
Xi to Xi buckets, we say that the share of Xi is Xi. The 
communication cost is calculated to be, for each relation, the 
size of the relation times the replication that is needed for 
each tuple of this relation. This replication can be calculated 
to be the product of the shares of all the attributes that do 
not appear in the relation. In order to keep the number of 
reducers equal to k, we need to calculate the shares so that 
their product is equal to k. 

Thus, in our example, the communication cost is riX 3 -\- 
r2Xi -\- r3X2 and we must have X1X2X3 = k. (We denote the 
size of a relation Ri hy ri.) In [^, it is explained how to use 
the Lagrangean method to find the shares that minimize the 
communication cost, for any multiway join. 

We are going to need an important observation that was 


proven in An attribute A is dominated by attribute B 
in the join if B appears in all relations where A appears. It 
is shown that if an attribute is dominated, then it does not 
get a share, or, in other words, its share is equal to 1 . 


2.1 Our Setting 

We saw how to compute the 2-way join in Example 1 1.2 1 for 
the tuples which have one HH. For this join, we took two 
sets of keys: 

• The set of keys as presented in Example |1.2| which send 
tuples with HH to a number of reducers in order to 
compute the join of tuples with HH. 

• The set of keys which send tuples without HH to a num¬ 
ber of reducers in order to compute the join of tuples 
without HH. This second set is formed exactly as in the 
Shares algorithm. 

It is convenient to see these two sets of keys as corresponding 
to two joins which we call residual joins, and which actually 
differ only on the subset of the data they are applied. One 
applies the original join on the data with HH and the other 
applies the original join on the data without HH. 

The method we presented in Example |1.2| is actually based 
on the Shares algorithm. To see this, we can be equivalently 
thought as: We replace each tuple of relation R with a tu¬ 
ple where B has distinct fresh values 61 , & 2 ,... and the same 
for the tuples of relation S with B having values 62 , ■ ■ 
Now we can apply the Shares algorithm to find the shares 
and distribute the tuples to reducers normally. The only 
problem with this plan is that the output will be empty be¬ 
cause we have chosen biS and b(s to be all distinct. This 
problem however has an easy solution, because, we can keep 
this replacement to the conceptual level, just so to create 
a HH-free join and be able to apply the Shares algorithm 
and compute the shares optimally. When we transfer the 
tuples to the reducers, however, we transfer the original tu¬ 
ples and thus, we produce the desired output. We explain 
this conceptual structure in Section]^ 

Our setting is as follows: We have k reducers to use for 
computing all residual joins. We assume each residual join 
Ji uses ki of those reducers, thus one constraint is fci -|- ^2 + 
• • • = k. For each residual join, we need to compute the 
communication cost expression. The objective function to 
minimize is the sum of the cost expressions over all residual 
joins, under the constraint: for each residual join Ji, the 
product of the attribute shares must be equal to ki. 

The aim of this paper is to show how to systematically apply 
the idea explained for the 2 -way join on any multiway join 
with any number of HH. The structure of the rest of the 
paper is the following: 

1. We decompose into residual joins, i.e., we partition the 
data into subsets and we view a residual join as the 
original join applied on one of the subsets (Section 3). 

2. We explain how to form a HH-free residual join and 
how to compute the communication cost expression for 
each residual join (Section 4). 

3. We show how the cost expression for each residual join is 
written in a simple and effective way (Section 5). 





3. Decomposition wrto HH 

First we need some definitions. 

For each attribute Xi we define a set Lxi of types: 

• If Xi has no values that are heavy hitters, then Lxi 
comprises of only one type, T_, called the ordinary 

typeu 

• If Xi has Pi values that are heavy hitters, then Lxi 
comprises of 1 + pi types: one type Tb for each heavy 
hitter, 6 , of Xi, and one ordinary type T-. 

A combination of types, Ct, is an element of the Cartesian 
product of the sets Lxi,i = 1 , 2 ... and defines a residual 
join. 

E.g., for the query in in Example |1.2[ we consider two resid¬ 
ual joins, one for type combination Ct = {A : T-,B : 
T-,C : r_} (without HH) and one for type combination 
Ct = {A:T-,B ■.Tb,C : T-} (with HH). 

Each Ct defines a residual join which is the join computed 
only on a subset of the data. Specifically, if an attribute X 
has ordinary type in the current Ct we exclude the tuples 
for which X = HB:. E.g., if there are two HH X — bi and 
X = b 2 , then we exclude (from all relations) all tuples with 
X = bi and X = b 2 . If attribute X is of type Tb then we 
exclude (from all relations) the tuples with value X ^ b. 

Example 3.1. We take as our running example the 3- 
way join: J = R{A, B) cxi S{B, E, C) lxi T{C, D) 

Suppose attribute B has two HHs, B = bi and B — b 2 and 
attribute C has one HH, and C = c\. Thus attribute B 
has three types, T-, Tb^ and Tb 2 , attribute C has two types, 
T- and Tc^ and the rest of the attributes have a single type, 
T-. Thus we have 3x2 = 6 residual joins, one for each 
combination. By r, s, t we denote the sizes of the relations 
that are relevant in each residual join, i.e., the number of 
tuples from each relation that contribute in the particular 
residual join. We list the residual joins: 

1. All attributes of type T-. Here r is the number of only 
those tuples of relation R for which B ^ bi and B yf 62 , 
s IS the number of only those tuples of relation S for 
which B ^ bi and B 7 ^ 62 and C ^ C\, and t is the 
number of those tuples in relation T for which C ^ c\. 

2. All attributes of type T_, except B of typle Tb^. 

3. All attributes of type T_, except B of typle Tb^. 

4 . All attributes of type T_, except C of typle Tc^. 

5. All attributes a of type T_, except B of type Tb^ and C 
of typle . 

6 . All attributes of type T-, except B of type Tb^ and C of 
typle Tci. 

Each residual join is treated by the Shares algorithm as a 
separate join and a set of keys are defined that hash each 
tuple as follows: A tuple t of relation Rj is sent to reducers 
of combinatrion Ct only if the values of the tuple satisfy the 
constraints of Ct as concerns values of HH. 

Example 3.2. We continue from Example \3.1\ Each tu¬ 
ple is sent to a number of reducers according to the keys 
created for each residual join(we will provide more details 
later in the paper). E.g., a tuple t from relation R is sent to 
reducers as follows: _ 

^Ordinary type represents all other values of attribute Xi, 
the ones that are not heavy hitters. 


1 . 


2 . 


3. 


If t has B — bi then it is sent to reducers created in 
items (2) and (5) in Example 


3.1 


If t has B bi and B yf 62 then it is sent to reducers 
created in items (1) and (4). 

If t has B = 62 then it is sent to reducers created in 
items (3) and (6). 


4. Writing the Cost Expression 


In this section we will explain how to form a HH-free residual 
join and how to compute the communication cost expression 
for each residual join. The structure we use in this section 
is conceptual, for the sake of showing how to write the cost 
expression. In practice, we do not materialize R{A, Br) and 
S{Bs,C) or the auxiliary relation (definitions of these will 
be given shortly) - as we will explain in the next section. 
We begin with an example: 


Example 4.1. We consider the residual join with HH B = 
b for the join of Example ] 1. 2l which we rewrite here: R{A, B) [xi 
S{B,C). In order to do that, we will eguivalently imagine 
that we have to compute: 

1. A join R{A,Bii) cx] Ra^x{BR,Bs) ix S{Bs, C) 

2. On new database D' which comes from D: We populate 
R(A, Br) with the same number of tuples as the origi¬ 
nal R(A,B) has: For each tuple t with B = b, we add 
in R{A, Br) a tuple where we have replaced the B = b 
with Br = b.t.R. We do similarly for S{Bs,C) re¬ 
placing B = b in tuple t by Bs = b.t.S. The relation 
Raux{BR, Bs) is the Cartesian product of Br and Br. 

• Observation 1: Database D' now has no heavy hitters. 
So, we can apply the original Shares algorithm. The 
introduction of auxiliary attributes and relations may 
seem now as if complicates things significantly, but, as 
we shall show in Section^ it does not. 

Thus if, in the database D, relation R is {(1, 2), (3, 2), (4, 2)} 
and S is {(2, 5), (2, 6)} then, in the database D' we have (as¬ 
suming B — 2 qualifies for HH): 


R{A, Br) is {(1, 2.1.B), (3, 2.3.B), (4, 2.4.B)}. 

S{Bs,C) is {(2.5.S,5),(2.6.5,6)}. 

(I.e., we conveniently identify the tuple of R with the value 
of its first argument and the tuple of S with the value of its 
second argument.) 

The auxiliary relation Raux{BR, Bs) is : 

{{2.1.R, 2.5.S), (2.3.B, 2.5.S), (2.4.B, 2.5.5), 

(2.1.B, 2.6.5), (2.3.B, 2.6.5), (2.4.B, 2.6.5)} 

The residual join computation has no heavy hitters, thus, 
we apply the original Shares algorithm, only that, when we 
compute the cost expression we ignore the communication 
cost for the auxiliary relotjon[^ Thus the communication 
cost of the residual join is again ry -\- sx, which is the same 
expression as in Example \l.l^ 

The conceptual structure in the general case is as follows: 
For each combination of types, Ct, we compute a HH-free 
residual join whose cost expression is written as follows: 


^We can igonore it because we know what are the tuples 
in the auxiliary relation and we can imagine that we can 
recreate them in the reducers. 





1. If attribute Xi has non-ordinary type in Ct then: 

-We introduce a number of auxiliary attributes, one 
auxiliary attribute for each relation Rj where attribute 
Xi appears. We denote the auxiliary attribute for rela¬ 
tion Rj by Xi-R. . 

-In the schema of each relation Rj where Xi appears, 
we replace Xi with attribute Xi-R. . 

2. We form the residual join J' for Ct by adding to orig¬ 
inal join new relations as follows: one relation, Ra-ix, 
for each attribute Xi which is not of ordinary type. 
The schema of that relation consists of the attributes 
Xi-R. for each j such that Xi is an attribute of Rj. 

3. Now we write the communication cost expression for J' 
as in the Shares algorithm taking care that: 

a. The communication cost expression does not in¬ 
clude a term for auxiliary relations. 

b. The size of each relation in J' that we use in the 
cost expression is the number of tuples that have 
as values in the arguments with heavy hitters the 
specific value for this combination of types. 

Now we will discuss in the next subsection how (and why) 
to simplify the cost expression not to indue share variables 
for the auxiliary attributes. 

5. Dominance Relation: Its Role in Sim¬ 
plifying the Cost Expression 

The property of the dominance relation allows us to write 
the cost expression for each residual join in a simple manner. 
We use the theorem: 

Theorem 5.1. The share of each auxiliary attribute is 
equal to 1 in the optimum so/wtion[^ 

Proof. Each auxiliary attribute appears in one relation 
of the original join and in one auxiliary relation. Since we 
do not add a term in the cost expression for the auxiliary 
relation, we imagine that we write the cost expression for a 
join which is the residual join without the auxiliary relations. 
Hence, an auxiliary attribute appears only in one relation, 
hence it is dominated by a ordinary (non-HH in this residual 
join) attribute. There is the exception: when all attributes 
in a relation are auxiliary attributes. In this case there is 
only one tuple in the relation in this particular residual join, 
so all attributes in the relation get share =1. □ 

Thus we established that: 

• The cost expression for each residual join can be de¬ 
rived from the cost expression of the original join (before 
dominance rule simplification) by making the shares of 
auxiliary attributes equal to 1. 

• Each tuple is hashed to reducers according to the values 
of the non-HH attributes in this tuple. 

Example 5.2. We continue from Example \3.1\ for the same 
HH as there. Remember by a, h, c, d, e we denote the shares 
for each attribute A, B, C, D, E respectively and by r, s, t we 

^Sometimes, we have a tie where in a relation all attributes 
appear only once; in this case we break ties declaring always 
the auxiliary attribute as dominated. 


denote the sizes of the relations that are relevant in each 
residual join, i.e., the number of tuples from each relation 
that contribute in the particular residual join. We always 
start with the cost expression for the original join, rede -\- 
sad -\- tabe, and then simplify accordingly. We list the cost 
expression for every residual join (and in the same order as) 
in Example \3.1\ 

1. Here all attributes are ordinary, so we simplifly the re¬ 
lation by observing that A is dominated by B and D 
is dominated by C, hence a = 1 and d = 1 and the 
expression is: 

rc -I- s -I- tb. 

2. Here B is a non-ordinary attribute, hence 6=1 and 
then, from the remaining attributes only D is dominated 
by C, hence d = 1 and the expression is: rc -|- sa -|- ta 

3. rc -\- sa -\- ta, i.e., same as above, only the sizes of the 
relations will be different. 

4 . rd-\- sd-\- tb 

5. Here we set both 6=1 and c = 1 and this gives us 
rde -\- sad -\- tae. 

6. rde -|- sad -\- tae, i.e., same as above, only the sizes of 
the relations will be different. 
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